Skip to content

fix(code-review): retry broken sandboxes#3110

Merged
alex-alecu merged 13 commits into
mainfrom
heartbreaking-ragamuffin
May 12, 2026
Merged

fix(code-review): retry broken sandboxes#3110
alex-alecu merged 13 commits into
mainfrom
heartbreaking-ragamuffin

Conversation

@alex-alecu
Copy link
Copy Markdown
Contributor

Why

Sometimes a code review can fail because the machine it is using breaks. When that happens, the review should get one clean second try instead of staying failed.

What changed

Code reviews now remember which sandbox and attempt they are using. When cloud-agent-next confirms a sandbox failed and destroys it, the web app claims affected reviews only once and starts them again in a fresh attempt. Old updates from the first attempt are ignored, so they cannot overwrite the retry. The same pull request check is reused and shows that the review is trying again.

How to test

  1. Run pnpm drizzle migrate.
  2. Run pnpm test -- "apps/web/src/app/api/internal/code-review-status/[reviewId]/route.test.ts" apps/web/src/app/api/internal/code-review-sandbox-destroyed/route.test.ts apps/web/src/lib/code-reviews/sandbox-retry.test.ts apps/web/src/lib/code-reviews/db/code-reviews.test.ts apps/web/src/lib/code-reviews/dispatch/dispatch-pending-reviews.test.ts apps/web/src/routers/code-reviews-router.test.ts.
  3. Run pnpm --filter kilo-code-review-worker test.
  4. Run pnpm --filter cloud-agent-next test -- sandbox-recovery.test.ts session-prepare.test.ts.
  5. Run pnpm --filter @kilocode/worker-utils test.

Comment thread apps/web/src/app/api/internal/code-review-status/[reviewId]/route.ts Outdated
Comment thread packages/db/src/migrations/0117_new_jasper_sitwell.sql Outdated
@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot Bot commented May 7, 2026

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Files Reviewed (7 files, incremental)
  • apps/web/src/app/api/internal/code-review-status/[reviewId]/route.test.ts
  • apps/web/src/app/api/internal/code-review-status/[reviewId]/route.ts
  • packages/db/src/schema-types.ts
  • packages/worker-utils/src/cloud-agent-next-client.ts
  • services/code-review-infra/src/code-review-orchestrator.ts
  • services/code-review-infra/src/types.ts
  • services/code-review-infra/test/integration/code-review-orchestrator.test.ts
Changes in this increment

The PR now moves sandbox-500 retry responsibility into the Durable Object (CodeReviewOrchestrator) rather than the web-app status callback. Key changes reviewed:

  • isCloudAgentNextSandboxInternalServerError — broadened from prepareSession-only to any procedure. The hasSandboxSignal && hasInternalServerSignal double-guard combined with the prior 5xx HTTP status check is a reasonable heuristic.
  • tryRetryFreshSessionAfterSandboxError — guarded by sandboxRetryAttempted flag (persisted to DO storage) which prevents infinite retry loops. Correct.
  • runFreshCloudAgentNextFallback — wraps runWithCloudAgentNext in try/catch to handle unexpected throws. Correct.
  • getSessionHealth preflight before sendMessageV2 in runWithCloudAgentNextFollowup — avoids sending follow-up to an unhealthy sandbox. Good defensive addition.
  • terminalReason now included in getStatus() response and setLocalTerminalStateFromDB — fixes a gap where DO lost the terminal reason on DB-terminal early exit.
  • sandbox_error added to CODE_REVIEW_TERMINAL_REASONS and CloudAgentTerminalReason — consistent with the new retry-at-DO-level approach.
  • Web-app route and dispatch layers simplified (attempt-tracking and sandbox-retry dispatch removed from web app, as the DO now owns retry).
Resolved Findings (from prior rounds)
  • apps/web/src/app/api/internal/code-review-status/[reviewId]/route.ts — sandbox retry now only skips terminal cleanup when a retry was actually claimed (resolved in 6dc8fef)
  • packages/db/src/migrations/0117_new_jasper_sitwell.sql — sandbox ID index now uses CREATE INDEX CONCURRENTLY (resolved in d58cf11)

Reviewed by claude-sonnet-4.6 · 8,315,607 tokens

@alex-alecu
Copy link
Copy Markdown
Contributor Author

Manual test passed.

Tested:

  • Started local Postgres, Next.js, code-review worker, and a local cloud-agent-next stub; dispatched a v2 follow-up review that hit a sendMessageV2 sandbox 500 and retried through a fresh session.
  • Dispatched a v2 fresh review where prepareSession returned sandbox 500 twice to verify retry exhaustion.
  • Posted an old-session status callback after a fresh session was persisted to verify stale callback handling.

Verified:

  • Follow-up flow called getSessionHealth, updateSession, sendMessageV2, then fresh prepareSession and initiateFromKilocodeSessionV2; the review completed with the fresh session and CLI session IDs persisted.
  • Retry exhaustion attempted prepareSession exactly twice, did not initiate a failed prepared session, and persisted failed with terminalReason=sandbox_error in DB and Durable Object status.
  • Stale old-session callback returned the stale-session success response and left persisted review state running on the current session.

@alex-alecu alex-alecu merged commit 7204baf into main May 12, 2026
39 checks passed
@alex-alecu alex-alecu deleted the heartbreaking-ragamuffin branch May 12, 2026 06:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants